Regularizing and Optimizing LSTM Language Models
نویسندگان
چکیده
In this paper, we consider the specific problem of word-level language modeling and investigate strategies for regularizing and optimizing LSTM-based models. We propose the weight-dropped LSTM, which uses DropConnect on hidden-tohidden weights, as a form of recurrent regularization. Further, we introduce NTAvSGD, a non-monotonically triggered (NT) variant of the averaged stochastic gradient method (AvSGD), wherein the averaging trigger is determined using a NT condition as opposed to being tuned by the user. Using these and other regularization strategies, our AvSGD Weight-Dropped LSTM (AWD-LSTM) achieves state-of-the-art word level perplexities on two data sets: 57.3 on Penn Treebank and 65.8 on WikiText-2. In exploring the effectiveness of a neural cache in conjunction with our proposed model, we achieve an even lower state-of-the-art perplexity of 52.8 on Penn Treebank and 52.0 on WikiText-2. We also explore the viability of the proposed regularization and optimization strategies in the context of the quasi-recurrent neural network (QRNN) and demonstrate comparable performance to the AWD-LSTM counterpart. The code for reproducing the results is open sourced and is available at https://github.com/salesforce/ awd-lstm-lm.
منابع مشابه
Regularizing RNNs by Stabilizing Activations
We stabilize the activations of Recurrent Neural Networks (RNNs) by penalizing the squared distance between successive hidden states’ norms. This penalty term is an effective regularizer for RNNs including LSTMs and IRNNs, improving performance on character-level language modelling and phoneme recognition, and outperforming weight noise and dropout. We achieve state of the art performance (17.5...
متن کاملRecurrent Neural Network Regularization
We present a simple regularization technique for Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units. Dropout, the most successful technique for regularizing neural networks, does not work well with RNNs and LSTMs. In this paper, we show how to correctly apply dropout to LSTMs, and show that it substantially reduces overfitting on a variety of tasks. These tasks include la...
متن کاملMulti-view and multi-task training of RST discourse parsers
We experiment with different ways of training LSTM networks to predict RST discourse trees. The main challenge for RST discourse parsing is the limited amounts of training data. We combat this by regularizing our models using task supervision from related tasks as well as alternative views on discourse structures. We show that a simple LSTM sequential discourse parser takes advantage of this mu...
متن کاملProcess Knowledge Extraction
This project presents two novel techniques to improve existing semantic role representations to enable better understanding of the language. Firstly, We have tried to retrofit word vectors generated from LSTM model with scientific processes corpus to generate better word embeddings. Second technique uses a semi-supervised model which learns word embeddings using role as context. On testing, We ...
متن کاملEmpirical Exploration of Novel Architectures and Objectives for Language Models
While recurrent neural network language models based on Long Short Term Memory (LSTM) have shown good gains in many automatic speech recognition tasks, Convolutional Neural Network (CNN) language models are relatively new and have not been studied in-depth. In this paper we present an empirical comparison of LSTM and CNN language models on English broadcast news and various conversational telep...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1708.02182 شماره
صفحات -
تاریخ انتشار 2017